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Abstract 

Bilateral filters have wide spread use due to their edge¬ 
preserving properties. The common use case is to manually 
choose a parametric filter type, usually a Gaussian filter. 
In this paper, we will generalize the parametrization and 
in particular derive a gradient descent algorithm so the fil¬ 
ter parameters can be learned from data. This derivation 
allows to learn high dimensional linear filters that operate 
in sparsely populated feature spaces. We build on the per- 
mutohedral lattice construction for efficient filtering. The 
ability to learn more general forms of high-dimensional fil¬ 
ters can be used in several diverse applications. First, we 
demonstrate the use in applications where single filter ap¬ 
plications are desired for runtime reasons. Further, we 
show how this algorithm can be used to learn the pair¬ 
wise potentials in densely connected conditional random 
fields and apply these to different image segmentation tasks. 
Finally, we introduce layers of bilateral filters in CNNs 
and propose bilateral neural networks for the use of high¬ 
dimensional sparse data. This view provides new ways to 
encode model structure into network architectures. A di¬ 
verse set of experiments empirically validates the usage of 
general forms of filters. 

1. Introduction 

Image convolutions are basic operations for many im¬ 
age processing and computer vision applications. In this 
paper we will study the class of bilateral filter convolu¬ 
tions and propose a general image adaptive convolution that 
can be learned from data. The bilateral filter [5, 45, 50] 
was originally introduced for the task of image denoising 
as an edge preserving filter. Since the bilateral filter con¬ 
tains the spatial convolution as a special case, we will in the 
following directly state the general case. Given an image 
V = (vi,..., v^), G with n pixels and c channels, 
and for every pixel i, a d dimensional feature vector G 
{e.g., the {x^y) position in the image = {xi^piY). The 


bilateral filter then computes 

n 

v-= (1) 

i=i 

for all i. Almost the entire literature refers to the bilat¬ 
eral filter as a synonym of the Gaussian parametric form 
v^f.,f^. = exp - fj)). The features 

fi are most commonly chosen to be position {xi^pi) and 
color {vi^pi^bi) or pixel intensity. To appreciate the edge¬ 
preserving effect of the bilateral filter, consider the five¬ 
dimensional feature f = (x, r, 6)^. Two pixels 
have a strong infiuence Wf. f^. on each other only if they 
are close in position and color. At edges the color changes, 
therefore pixels lying on opposite sides have low infiuence 
and thus this filter does not blur across edges. This be¬ 
haviour is sometimes referred to as “image adaptive”, since 
the filter has a different shape when evaluated at different lo¬ 
cations in the image. More precisely, it is the projection of 
the filter to the two-dimensional image plane that changes, 
the filter values Wf f/ do not change. The filter itself can be 
of c dimensions Wf. f^. G in which case the multiplica¬ 
tion in Eq. (1) becomes an inner product. For the Gaussian 
case the filter can be applied independently per channel. For 
an excellent review of image filtering we refer to [39]. 

The filter operation of Eq. (1) is a sparse high¬ 
dimensional convolution, a view first developed in [40] . An 
image v is not sparse in the spatial domain, we observe 
pixels values for all locations {x^y). However, when pix¬ 
els are understood in a higher dimensional feature space, 
e.g., {x,y,r,g,b), the image becomes a sparse signal, since 
the r,g,b values lie scattered in this five-dimensional space. 
This view on filtering is the key difference of the bilateral 
filter compared to the common spatial convolution. An im¬ 
age edge is not “visible” for a filter in the spatial domain 
alone, whereas in the 5D space it is. The edge-preserving 
behaviour is possible due to the higher dimensional opera¬ 
tion. Other data can naturally be understood as sparse sig¬ 
nals, e.g., 3D surface points. 
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The contribution of this paper is to propose a general and 
learnable sparse high dimensional convolution. Our tech¬ 
nique builds on efficient algorithms that have been devel¬ 
oped to approximate the Gaussian bilateral filter and re-uses 
them for more general high-dimensional filter operations. 
Due to its practical importance (see related work in Sec. 2) 
several efficient algorithms for computing Eq. (1) have been 
developed, including the bilateral grid [40], Gaussian KD- 
trees [3], and the permutohedral lattice [2]. The design goal 
for these algorithms was to provide a) fast runtimes and b) 
small approximation errors for the Gaussian filter case. The 
key insight of this paper is to use the permutohedral lattice 
and use it not as an approximation of a predefined kernel 
but to freely parametrize its values. We relax the separa¬ 
ble Gaussian filter case from [2] and show how to compute 
gradients of the convolution (Sec. 3) in lattice space. This 
enables learning the filter from data. 

This insight has several useful consequences. We discuss 
applications where the bilateral filter has been used before: 
image filtering (Sec. 4) and CRF inference (Sec. 5). Further 
we will demonstrate how the free parametrization of the fil¬ 
ters enables us to use them in Deep convolutional neural 
networks (CNN) and allow convolutions that go beyond the 
regular spatially connected receptive fields (Sec. 6). For all 
domains, we present various empirical evaluations with a 
wide range of applications. 

2. Related Work 

We categorize the related work according to the three 
different generalizations of this work. 

Image Adaptive Filtering: The literature in this area is 
rich and we can only provide a brief overview. Impor¬ 
tant classes of image adaptive filters include the bilateral 
filters [5, 50, 45], non-local means [13, 6], locally adap¬ 
tive regressive kernels [47], guided image filters [26] and 
propagation filters [42]. The kernel least-squares regression 
problem can serve as a unified view of many of them [39]. 
In contrast to the present work that learns the filter kernel 
using supervised learning, all these filtering schemes use 
a predefined kernel. Because of the importance of the bi¬ 
lateral filtering to many applications in image processing, 
much effort has been devoted to derive fast algorithms; most 
notably [40, 2, 3, 22]. Surprisingly, the only attempt to learn 
the bilateral filter we found is [27] that casts the learning 
problem in the spatial domain by rearranging pixels. How¬ 
ever, the learned filter does not necessarily obey the full 
region of influence of a pixel as in the case of a bilateral 
filter. The bilateral filter also has been proposed to regu¬ 
larize a large set of applications in [9, 8] and the respec¬ 
tive optimization problems are parametrized in a bilateral 
space. They provide gradients to use the filters as part of a 
learning system but unlike this work restrict the filters to be 
Gaussian. 


Dense CRF: The key observation of [33] is that mean- 
field inference update steps in densely connected CRFs with 
Gaussian edge potentials require Gaussian bilateral filter¬ 
ing operations. This enables tractable inference through the 
application of a fast filter implementation from [2]. This 
quickly found wide-spread use, e.g., the combination of 
CNNs with a dense CRF is among the best performing seg¬ 
mentation models [16, 55, 11]. These works combine struc¬ 
tured prediction frameworks on top of CNNs, to model the 
relationship between the desired output variables thereby 
significantly improving upon the CNN result. Bilateral neu¬ 
ral networks, that are presented in this work, provide a 
principled framework for encoding the output relationship, 
using the feature transformation inside the network itself 
thereby alleviating some of the need for later processing. 
Several works [34, 18, 31] demonstrate how to learn free 
parameters of the dense CRF model. However, the para¬ 
metric form of the pairwise term always remains a Gaus¬ 
sian. Campbell et al. [15] embed complex pixel dependen¬ 
cies into an Euclidean space and use a Gaussian filter for 
pairwise connections. This embedding is a pre-processing 
step and can not directly be learned. In Sec. 5 we will dis¬ 
cuss how to learn the pairwise potentials, while retaining 
the efficient inference strategy of [33]. 


Neural Networks: In recent years, tremendous progress 
in a wide range of computer vision applications has been 
observed due to the use of CNNs. Most CNN architectures 
use spatial convolution layers, which have fixed local recep¬ 
tive fields. This work suggests to replace these layers with 
bilateral filters, which have a varying spatial receptive field 
depending on the image content. The equivalent represen¬ 
tation of the filter in a higher dimensional space leads to 
sparse samples that are handled by a permutohedral lattice 
data structure. Similarly, Bruna et al. [12] propose convo¬ 
lutions on irregularly sampled data. Their graph construc¬ 
tion is closely related to the high-dimensional convolution 
that we propose and defines weights on local neighborhoods 
of nodes. However, the structure of the graph is bound 
to be fixed and it is not straightforward to add new sam¬ 
ples. Furthermore, re-using the same filter among neigh¬ 
borhoods is only possible with their costly spectral con¬ 
struction. Both cases are handled naturally by our sparse 
convolution. Jaderberg et al. [29] propose a spatial trans¬ 
formation of signals within the neural network to learn in¬ 
variances for a given task. The work of [28] propose matrix 
backpropagation techniques which can be used to build spe¬ 
cialized structrual layers such as normalized-cuts. Graham 
et al. [24] propose extensions from 2D CNNs to 3D sparse 
signals. Our work enables sparse 3D filtering as a special 
case, since we use an algorithm that allows for even higher 
dimensional data. 




Figure 1. Schematic of the permutohedral convolution. Left: 
splatting the input points (orange) onto the lattice corners (black); 
Middle: The extent of a filter on the lattice with a s = 2 neighbor¬ 
hood (white circles), for reference we show a Gaussian filter, with 
its values color coded. The general case has a free scalar/vector 
parameter per circle. Right: The result of the convolution at the 
lattice corners (black) is projected back to the output points (blue). 
Note that in general the output and input points may be different. 

3. Learning Sparse High Dimensional Filters 

In this section, we describe the main technical contribu¬ 
tion of this work, we generalize the permutohedral convo¬ 
lution [2] and show how the filter can be learned from data. 

Recall the form of the bilateral convolution from Eq. (1). 
A naive implementation would compute for every pixel i 
all associated filter values and perform the summa¬ 

tion independently. The view of w as a linear filter in a 
higher dimensional space, as proposed by [40], opened the 
way for new algorithms. Here, we will build on the permu¬ 
tohedral lattice convolution developed in Adams et al. [2] 
for approximate Gaussian filtering. 

3.1. Permutohedral Lattice Convolutions 

We first review the permutohedral lattice convolution for 
Gaussian bilateral filters from Adams et al. [2] and describe 
its most general case. 

As before, we assume that every image pixel i is as¬ 
sociated with a (i-dimensional feature vector f^. Gaussian 
bilateral filtering using a permutohedral lattice approxima¬ 
tion involves 3 steps. We begin with an overview of the al¬ 
gorithm, then discuss each step in more detail in the next 
paragraphs. Fig. 1 schematically shows the three opera¬ 
tions for 2D features. First, interpolate the image signal on 
the (i-dimensional grid plane of the permutohedral lattice, 
which is called splatting. A permutohedral lattice is the tes¬ 
sellation of space into permutohedral simplices. We refer 
to [2] for details of the lattice construction and its proper¬ 
ties. In Fig. 2, we visualize the permutohedral lattice in 
the image plane, where every simplex cell receives a differ¬ 
ent color. All pixels of the same lattice cell have the same 
color. Second, convolve the signal on the lattice. And third, 
retrieve the result by interpolating the signal at the original 
(i-dimensional feature locations, called slicing. For exam¬ 
ple, if the features used are a combination of position and 
color = {xi^Hi^Vi, pi^bi)^, the input signal is mapped 
into the 5D cross product space of position and color and 
then convolved with a 5D tensor. The result is then mapped 
back to the original space. In practice we use a feature scal¬ 



(a) Sample Image (b) Position (c) Color (d) Position, Color 

Figure 2. Visualization of the Permutohedral Lattice, (a) In¬ 
put image; Lattice visualizations for different feature spaces: (b) 
2D position features: 0.01{x,y), (c) color features: 0.01(r, ^,b) 
and (d) position and color features: 0.01{x, y, r, g,b). All pixels 
falling in the same simplex cell are shown with the same color. 


ing D^ with a diagonal matrix D and use separate scales 
for position and color features. The scale determines the 
distance of points and thus the size of the lattice cells. More 
formally, the computation is written by v' = 5'siice^5'spiatV 
and all involved matrices are defined below. For notational 
convenience we will assume scalar input signals Vi, the vec¬ 
tor valued case is analogous, the lattice convolution changes 
from scalar multiplications to inner products. 


Splat: The splat operation (cf. left-most image in Fig. 1) 
finds the enclosing simplex in 0{d?) on the lattice of a 
given pixel feature ^i and distributes its value Vi onto the 
comers of the simplex. How strong a pixel contributes to 
a corner j is defined by its barycentric coordinate bij G M 
inside the simplex. Thus, the value Ij G M at a lattice point 
j is computed by summing over all enclosed input points; 
more precisely, we define an index set Ji for a pixel i, which 
contains all the lattice points j of the enclosing simplex 
^ = 5'spiatv; (5'spiat)j,i = bi,j, ifi e Ji, otherwise 0. (2) 

Convolve: The permutohedral convolution is defined on 
the lattice neighborhood Ns{j) of lattice point j, e.g., only 
s grid hops away. More formally 

£' = B£; iff G Ns{j), otherwise 0. (3) 

An illustration of a two-dimensional permutohedral filter is 
shown in Fig. 1 (middle). Note that we already presented 
the convolution in the general form that we will make use 
of. The work of [2] chooses the filter weights such that 
the resulting operation approximates a Gaussian blur, which 
is illustrated in Fig. 1. Further, the algorithm of [2] takes 
advantage of the separability of the Gaussian kernel. Since 
we are interested in the most general case, we extended the 
convolution to include non-separable filters B. 


Slice: The slice operation (cf. right-most image in Fig. 1) 
computes an output value v'-, for an output pixel i' again 
based on its bary centric coordinates bij and sums over the 
comer points j of its lattice simplex 
v' = Ssiice^'; (S'siice)ij = hj, if j G Ji, Otherwise 0 (4) 
The splat and slice operations take a role of an interpo¬ 
lation between the different signal representations: the ir¬ 
regular and sparse distribution of pixels with their associ¬ 
ated feature vectors and the regular structure of the permu¬ 
tohedral lattice points. Since high-dimensional spaces are 















usually sparse, performing the convolution densely on all 
lattice points is inefficient. So, for speed reasons, we keep 
track of the populated lattice points using a hash table and 
only convolve at those locations. 

3.2. Learning Permutohedral Filters 

The fixed set of filter weights v^ from [2] in Eq. 3 are 
designed to approximate a Gaussian filter. However, the 
convolution kernel w can naturally be understood as a gen¬ 
eral filtering operation in the permutohedral lattice space 
with free parameters. In the exposition above we already 
presented this general case. As we will show in more de¬ 
tail later, this modification has non-trivial consequences for 
bilateral filters, CNNs and probabilistic graphical models. 

The size of the neighborhood Ns{k) for the blur in 
Eq. 3 compares to the filter size of a spatial convolu¬ 
tion. The filtering kernel of a common spatial convolution 
that considers s points to either side in all dimensions has 
(2s-|-l)^ G 0{s^) parameters. A comparable filter on the 
permutohedral lattice with an s neighborhood is specified 
by {s + 1)^+^ _gd-\-i ^ elements (cf. Sec. A). Thus, 

both share the same asymptotic size. 

By computing the gradients of the filter elements we en¬ 
able the use of gradient based optimizers, e.g., backprop- 
agation for CNN in the same way that spatial filters in a 
CNN are learned. The gradients with respect to v and the 
filter weights in 5 of a scalar loss L are: 

= (5) 

= (^s'licel^) (5splatV)i- (6) 

Both gradients are needed during backpropagation and in 
experiments, we use stochastic backpropagation for learn¬ 
ing the filter kernel. The permutohedral lattice convolution 
is parallelizable, and scales linearly with the filter size. Spe¬ 
cialized implementations run at interactive speeds in image 
processing applications [2]. Our implementation allows ar¬ 
bitrary filter parameters and the computation of the gradi¬ 
ents on both CPU and GPU. It is also integrated into the 
caffe deep learning framework [30]. 

4. Single Bilateral Filter Applications 

In this section we will consider the problem of joint bi¬ 
lateral upsampling [32] as a prominent instance of a single 
bilateral filter application. See [41] for a recent overview 
of other bilateral filter applications. Eurther experiments on 
image denoising and 3D body mesh denoising are included 
in Sec. B.3, together with details about exact experimental 
protocols and more visualizations. 


Upsampling factor 

Bicubic 

Gaussian 

Learned 

Color Upsampling (PSNR) 

2x 

24.19/30.59 

33.46/37.93 

34.05/38.74 

4x 

20.34 / 25.28 

31.87/35.66 

32.28/36.38 

8x 

17.99/22.12 

30.51 /33.92 

30.81/34.41 

16x 

16.10/19.80 

29.19/32.24 

29.52/32.75 

Depth Upsampling (RMSE) 

8x 

0.753 

0.753 

0.748 


Table 1. Joint bilateral upsampling, (top) PSNR values corre¬ 
sponding to various upsampling factors and upsampling strate¬ 
gies on the test images of the Pascal VOC12 segmentation / high- 
resolution 2MP dataset; (bottom) RMSE error values correspond¬ 
ing to upsampling depth images estimated using [19] computed on 
the test images from the NYU depth dataset [43]. 

4.1. Joint Bilateral Upsampling 

A typical technique to speed up computer vision algo¬ 
rithms is to compute results on a lower scale and upsample 
the result to the full resolution. This upsampling step may 
use the original resolution image as a guidance image. A 
joint bilateral upsampling approach for this problem setting 
was developed in [32]. We describe the procedure for the 
example of upsampling a color image. Given a high reso¬ 
lution gray scale image (the guidance image) and the same 
image on a lower resolution but in colors, the task is to up¬ 
sample the color image to the same resolution as the guid¬ 
ance image. Using the permutohedral lattice, joint bilateral 
upsampling proceeds by splatting the color image into the 
lattice, using 2D position and ID intensity as features and 
the 3D RGB values as the signal. A convolution is applied 
in the lattice and the result is read out at the pixels of the 
high resolution image, that is using the 2D position and in¬ 
tensity of the guidance image. The possibility of reading out 
(slicing) points that are not necessarily the input points is an 
appealing feature of the permutohedral lattice convolution. 

4.1.1 Color Upsampling 

Eor the task of color upsampling, we compare the Gaus¬ 
sian bilateral filter [32] against a learned generalized filter. 
We experimented with two different datasets: One is Pascal 
VOC2012 segmentation dataset [20] using the pre-defined 
train, validation and test splits, and other is 200 high- 
resolution (2MP) image dataset collected using Google im¬ 
age search [1] with 100 train, 50 validation and 50 test im¬ 
ages. Eor training we use the mean squared error (MSE) 
criterion and perform stochastic gradient descent with a mo¬ 
mentum term of 0.9, and weight decay of 0.0005. The 
learning rate and feature scales D have been cross-validated 
using the validation split. The Gaussian and the learned 
filter have the same support. In Table 1 we report result 
in terms of PSNR for the upsampling factors 2 x, 4 x, 8 x 
and 16 X. We compare a standard bicubic interpolation, that 
does not use a guidance image, the Gaussian bilateral filter 
case (with feature scales optimized on the validation set). 


dv 

dL 

mz, 








(a) Input (b) Guidance (c) Ground Truth (d) Bicubic Interpolation (e) Gauss Bilateral (f) Learned Bilateral 

Figure 3. Guided Upsampling. Color (top) and depth (bottom) 8x upsampling results using different methods (best viewed on screen). 


and the learned filter. For all upsampling factors, we ob¬ 
serve that joint bilateral Gaussian upsampling outperforms 
bicubic interpolation. Learning a general convolution fur¬ 
ther improves over the Gaussian filter case. A result of the 
upsampling is shown in Fig. 3 and more results are included 
in Sec. C.2. The learned filter recovers finer details in the 
images 

4.1.2 Depth Upsampling 

We use the work on recovering depth estimates from sin¬ 
gle RGB images [19] to validate the learned filter on an¬ 
other domain for the task of joint upsampling. The dataset 
of [43] provides a benchmark for this task and comes with 
pre-defined train, validation and test splits that we re-use. 
The approach of [19] is a CNN model that produces a re¬ 
sult at l/4th of the input resolution due to down-sampling 
operations in max-pooling layers. Furthermore, the authors 
downsample the 640 x 480 images to 320 x 240 as a pre¬ 
processing step before CNN convolutions. The final depth 
result is bicubic interpolated to the original resolution. It 
is this interpolation that we replace with a Gaussian and 
learned joint bilateral upsampling. The features are five¬ 
dimensional position and color information from the high 
resolution input image. The filter is learned using the same 
protocol as for color upsampling minimizing MSE predic¬ 
tion error. The quantitative results are shown in Table 1 , the 
Gaussian filter performs equal to the bicubic interpolation 
(p-value 0.311), the learned filter is better (p-value 0.015). 
Qualitative results are shown in Fig 3, both joint bilateral 
upsampling respect image edges in the result. For this [21] 
and other tasks specialized interpolation algorithms exist, 
e.g., deconvolution networks [54]. Part of future work is to 
equip these approaches with bilateral filters. 

5. Learning Pairwise Potentials in Dense CRFs 

The bilateral convolution from Sec. 3 generalizes the 
class of dense CRF models for which the mean-field in¬ 
ference from [33] applies. The dense CRF models have 


found wide-spread use in various computer vision applica¬ 
tions [46, 10, 55, 52, 51, 11]. Let us briefly review the dense 
CRF and then discuss its generalization. 

5.1. Dense CRF Review 

Consider a fully pairwise connected CRF with dis¬ 
crete variables over each pixel in the image. For an 
image with n pixels, we have random variables X = 
{Xi, A 2 ,..., Xn} with discrete values {xi,..., in the 
label space Xi G The dense CRF model has 

unary potentials 'ipu{x) G e.g., these can be the out¬ 
put of CNNs. The pairwise potentials in [33] are of the 
form 'ippi^Xi^Xj) = fI{xi,Xj)k{^i,^j) where p is a label 
compatibility matrix, k is 3 . Gaussian kernel k{fi,fj) = 
exp(—(f^ — — and the vectors are feature 

vectors, just alike the ones used for the bilateral filtering, 
e.g., (x, p, r, p, b). The Gibbs distribution for an image v 
thus reads p{x\v) oc exp{-^. 2 pu{xi)- Y,i>j i^p{xi,Xj)). 
Because of dense connectivity, exact MAP or marginal in¬ 
ference is intractable. The main result of [33] is to derive 
the mean-field approximation of this model and to relate it 
to bilateral filtering which enables tractable approximate in¬ 
ference. Mean-field approximates the model p with a fully 
factorized distribution q = JJ. qi{xi) and solves for q by 
minimizing their KL divergence KL{q\\p). This results 
in a fixed point equation which can be solved iteratively 
f = 0, 1 ,... to update the marginal distributions qi, 

= Lexp{-V;„(a;i)-y^ tpp(xi,Xj)q*{xj)}. 
* j^i xjec 

bilateral filtering 

(7) 

for all i. Here, Zi ensures normalizations of the distribu¬ 
tion qi . 

5.2. Learning Pairwise Potentials 

The proposed bilateral convolution generalizes the class 
of potential functions since they allow a richer class of 
kernels k{^i,^j) that furthermore can be learned from data. 






























(a) Input (b) Ground Truth (c) CNN (d) +looseMF 

Figure 4. Segmentation results. An example result for semantic 
(top) and material (bottom) segmentation, (c) depicts the unary 
results before application of MF, (d) after two steps of loose-MF 
with a learned CRF. More examples with comparisons to Gaussian 
pairwise potentials are in Sec. C.5 and Sec. C.6. 

So far, all dense CRF models have used Gaussian potential 
functions k, we replace it with the general bilateral convolu¬ 
tion and learn the parameters of kernel k, thus in effect learn 
the pairwise potentials of the dense CRF. This retains the 
desirable properties of this model class - efficient inference 
through mean-field and the feature dependency of the pair¬ 
wise potential. The benefit of the Gaussian pairwise poten¬ 
tials is to encode similarity between pixels through distance 
and appearance, in other words, pixels with similar (r, b) 
values are likely to be of the same class. In order to learn the 
form of the pairwise potentials k we make use of the gra¬ 
dients for filter parameters in k and use back-propagation 
through the mean-field iterations [18, 37] to learn them. 

The work of [34] derived gradients to learn the feature 
scaling D but not the form of the kernel k, which still was 
Gaussian. In [15], the features fi were derived using a non- 
parametric embedding into a Euclidean space and again a 
Gaussian kernel was used. The computation of the embed¬ 
ding was a pre-processing step, not integrated in a end-to- 
end learning framework. Both aforementioned works are 
generalizations that are orthogonal to our development and 
can be used in conjunction. 

5.3. Experimental Evaluation 

We evaluate the effect of learning more general forms 
of potential functions on two pixel labeling tasks, seman¬ 
tic segmentation of VOC data [20] and material classifica¬ 
tion [11]. We use pre-trained models from the literature 
and empirically evaluate the relative change when learning 
the pairwise potentials. For both the experiments, we use 
multinomial logistic classification loss and learn the filters 
via back-propagation. 

5.3.1 Semantic Segmentation 

Semantic segmentation is the task of assigning semantic la¬ 
bel (e.g., car, person etc.) to every pixel. We choose the 
DeepLab network [16], a variant of the VGGnet [44] for 
obtaining unaries. The DeepLab architecture runs a CNN 


+ MF-lstep 

+ MF-2 step 

+ loose MF-2 step 

Semantic segmentation (loU) - CNN [16]: 72.08 / 66.95 


Gauss CRF +2.48 

+3.38 

+3.38 / +3.00 

Learned CRF +2.93 

+3.71 

+3.85 / +3.37 

Material segmentation (Pixel Accuracy) 

-CNN [11]: 67.21/69.23 


Gauss CRF +7.91/+6.28 

+9.68/+7.35 

+9.68 / +7.35 

Learned CRF +9.48/+6.23 

+11.89/+6.93 

+11.91/+6.93 


Table 2. Improved mean-field inference with learned poten¬ 
tials. (top) Average loU score on Pascal VOC 12 validation/test 
data [20] for semantic segmentation; (bottom) Accuracy for all 
pixels / averaged over classes on the MINC test data [11] for ma¬ 
terial segmentation. 

model on the input image to obtain a result that is down- 
sampled by a factor of 8. The result is then bilinear interpo¬ 
lated to the desired resolution and serves as unaries 'ipu{xi) 
in a dense CRF. We use the same Pott’s label compatibility 
function /i, and also use two kernels fj) + k^ipi^Pj) 
with the same features = {xi^yi^Vi^ pi)^ and = 
as in [16]. Thus, the two filters operate in par¬ 
allel on color & position, and spatial domain respectively. 
We also initialize the mean-field update equations with the 
CNN unaries. The only change in the model is the type of 
the pairwise potential function from Gauss to a generalized 
form. 

We evaluate the result after 1 step and 2 steps of mean- 
field inference and compare the Gaussian filter versus the 
learned version (cf. Tab. 2). First, as in [16] we observe that 
one step of mean field improves the performance by 2.48% 
in Intersection over Union (loU) score. However, a learned 
potential increases the score by 2.93%. The same behaviour 
is observed for 2 steps: the learned result again adds on top 
of the raised Gaussian mean field performance. Further, we 
tested a variant of the mean-field model that learns a sepa¬ 
rate kernel for the first and second step [37]. This “loose” 
mean-field model leads to further improvement of the per¬ 
formance. It is not obvious how to take advantage of a loose 
model in the case of Gaussian potentials. 

5.3.2 Material Segmentation 

We adopt the method and dataset from [11] for material seg¬ 
mentation task. Their approach proposes the same architec¬ 
ture as in the previous section; a CNN to predict the material 
labels (e.g., wool, glass, sky, etc.) followed by a densely 
connected CRF using Gaussian potentials and mean-field 
inference. We re-use their pre-trained CNN that is publicly 
available and further choose the CRF parameters and Lab 
color/position features as in [11]. We compare the Gaus¬ 
sian pairwise potentials versus learned potentials with the 
same training procedure as for semantic segmentation. Re¬ 
sults for pixel accuracy and class-averaged pixel accuracy 
are shown in Table 2. Following the CRF validation in [11], 
we ignored ‘other’ label for both the training and evalua¬ 
tion. For this dataset, the availability of training data is 









(a) Sample tile images 



(b) NN architecture (c) loU versus Epochs 

Figure 5. Segmenting Tiles, (a) Example tile input images; (b) 
the 3-layer NN architecture used in experiments. “Conv” stands 
for spatial convolutions, resp. bilateral convolutions; (c) Training 
progress in terms of Validation loU versus training epochs. 

small, since there are 928 images that are only sparsely 
annotated with segments. While that is enough to cross- 
validate parameters for Gaussian CRFs, we would expect 
the general bilateral convolution to benefit from more train¬ 
ing data. Especially for class-averaged results we anticipate 
a higher increase in performance, which we believe is hin¬ 
dered through size of the training set. Visual results with 
annotations are shown in Fig. 4 and more are included in 
Sec. C.6. 

6. Bilateral Neural Networks 

Probably the most promising opportunity for the gener¬ 
alized bilateral filter is its use in Convolutional Neural Net¬ 
works. Since we are not restricted to the Gaussian case any¬ 
more, we can run several filters sequentially in the same 
way as filters are ordered in layers in typical spatial CNN 
architectures. Having the gradients available allows for end- 
to-end training with backpropagation, without the need for 
any change in CNN training protocols. We refer to the 
layers of bilateral filters as “bilateral convolution layers” 
(BCL). As discussed in the introduction, these can be under¬ 
stood as either linear filters in a high dimensional space or a 
filter with an image adaptive receptive field. In the remain¬ 
der we will refer to CNNs that include at least one bilateral 
convolutional layer as a bilateral neural network (BNN). 

What are the possibilities of a BCL compared to a stan¬ 
dard spatial layer? First, we can define a feature space 

G to define proximity between elements to perform 
the convolution. This can include color or intensity as in 
the previous example. Second, since the permutohedral lat¬ 
tice convolution takes advantage of hashing it is advanta¬ 
geous for data in a sparse format. We performed a runtime 
comparison between our implementation of a BCL^ and the 

^We have identified several possible speed-ups that are not taken ad¬ 
vantage of yet. 


caffe [30] implementation of a (i-dimensional convolution. 
For 2D positional features (first row), the standard layer 
is faster since the permutohedral algorithm comes with an 
overhead. For higher dimensions d > 2, the runtime de¬ 
pends on the sparsity; but ignoring the sparsity is quickly 
leading to intractable runtimes. The permutohedral lattice 
convolution was designed especially for this purpose, al¬ 
though with Gaussian filtering in mind. 


Dim.-Features d-dim caffe BCL 

2D-(x, y) 3.3 ± 0.3 / 0.5± 0.1 4.8 ± 0.5 / 2.8 ± 0.4 

3D-(r, g, b) 364.5 ± 43.2 /12.1 ± 0.4 5.1 ± 0.7 / 3.2 ± 0.4 

4D-(cc, r,g,b) 30741.8 ± 9170.9 / 1446.2 ± 304.7 6.2 ± 0.7 / 3.8 ± 0.5 

5D-(cc, y,r, g, b) out of memory 7.6 ± 0.4 / 4.5 ± 0.4 

Table 3. Runtime comparison: BCL vs. spatial convolution. 

Average CPU/GPU runtime (in ms) of 50 1 -neighborhood filters 
averaged over 1000 images from Pascal VOC. All scaled features 
{x, y, r, g, b) G [0, 50). BCL includes splatting and splicing oper¬ 
ations which in layered networks can be re-used. 

Next we illustrate two use cases of BNNs and compare 
against spatial CNNs. Section B.l contains further explana¬ 
tory experiments with examples on MNIST digit recogni¬ 
tion. 

6.1. An Illustrative Example: Segmenting Tiles 

In order to highlight the model possibilities of using 
higher dimensional sparse feature spaces for convolutions 
through BCLs, we constructed the following illustrative 
problem. A randomly colored foreground tile with size 
20 X 20 is placed on a random colored background of size 
64 X 64 with additional Gaussian noise with std. dev. 0.02 
added with color values normalized to [0,1]. The task is 
to segment out the smaller tile. Some example images are 
shown in Fig. 5(a). A pixel classifier can not distinguish 
foreground from background since the color is random. We 
use a three layer CNN with convolutions interleaved with 
ReLU functions. The schematic of the architecture is shown 
in Fig 5(b) (32,16, 2 filters). We train standard CNNs, with 
filters of size n x n,n G {9,13,17,21}. We create 10k 
training, Ik validation and Ik test images and, use the vali¬ 
dation set to choose learning rates. In Fig. 5(c) we plot the 
validation loU as a function of training epochs. All CNNs 
(and BNNs) converge to almost perfect test predictions. 

Now, we replace all spatial convolutions with bilat¬ 
eral convolutions for a full BNN. The features are = 
{xi^yi^Vi^ Qi^bi)^ and the filter has a neighborhood of 1. 
The total number of parameters in this network is around 
4:0k compared to b2k for 9 x 9 up to 282k for a 21 x 21 
CNN. With the same training protocol and optimizer, the 
convergence rate of BNN is much faster. In this example 
as in semantic segmentation discussed in the last section, 
color is a discriminative feature for the label. The bilateral 
convolutions “see” the color difference, the points are al- 


































(a) Sample Assamese character images (9 classes, 2 samples each) 



(b) LeNet training (c) DeepCNet training 

Figure 6. Character recognition, (a) Sample Assamese character 
images [7]; and training progression of various models with (b) 
LeNet and (c) DeepCNet base networks. 


ready pre-grouped in the permutohedral lattice and the task 
remains to assign a label to the two groups. 

6.2. Character Recognition 

The results for tile, semantic, and material segmenta¬ 
tions when using general bilateral filters mainly improved 
because the feature space was used to encode useful prior 
information about the problem (similar RGB of close-by 
pixels have the same label). Such prior knowledge is often 
available when structured predictions are to be made but the 
input signal may also be in a sparse format to begin with. 
Let us consider handwritten character recognition, one of 
the prime cases for CNN use. 

The Assamese character dataset [7] contains 183 differ¬ 
ent Indo-Aryan symbols with 45 writing samples per class. 
Some sample character images are shown in Fig. 6(a). This 
dataset has been collected on a tablet PC using a pen input 
device and has been pre-processed to binary images of size 
96 X 96. Only about 3% of the pixels contain a pen stroke, 
which we will denote hy vi = 1. 

A CNN is a natural choice to approach this classifica¬ 
tion task. We experiment with two CNN architectures for 
this experiment that have been used for this task, LeNet-7 
from [35] and DeepCNet [17, 23]. The LeNet is a shal¬ 
lower network with bigger filter sizes whereas DeepCNet 
is deeper with smaller convolutions. Both networks are 
fully specified in Sec. C.4. In order to simplify the task for 
the networks we cropped the characters by placing a tight 
bounding box around them and providing the bounding 
boxes as input to the networks. We will call these networks 
“Crop-”LeNet, resp. DeepCNet. For training, we randomly 
divided the data into 30 writers for training, 6 for validation 
and the remaining 9 for test. Fig. 6(b) and Fig. 6(c) show the 
training progress for various LeNet and DeepCNet models 
respectively. DeepCNet is a better choice for this problem 




Crop- 

BNN- 

DeepC¬ 

Crop- 

BNN- 


LeNet 

LeNet 

LeNet 

Net 

DeepCNet DeepCNet 

Validation 

59.29 

68.67 

75.05 

82.24 

81.88 84.15 

Test 

55.74 

69.10 

74.98 

79.78 

80.02 

84.21 


Table 4. Results on Assamese character images. Total recogni¬ 
tion accuracy for the different models. 


and for both cases pre-processing the data by cropping im¬ 
proves convergence. 

The input is spatially sparse and the BCL provides a nat¬ 
ural way to take advantage of this. For both networks we 
create a BNN variant (BNN-LeNet and BNN-DeepCNet) 
by replacing the first layer with a bilateral convolutions us¬ 
ing the features = {xi^yi)^ and we only consider the 
foreground points Vi = 1. The values {xi,yi) denote the 
position of the pixel with respect to the top-left corner of 
the bounding box around the character. In effect the lattice 
is very sparse which reduces runtime because the convolu¬ 
tions are only performed on 3% of the points that are actu¬ 
ally observed. A bilateral filter has 7 parameters compared 
to a receptive field of 3 x 3 for the first DeepCNet layer 
and 5 X 5 for the first LeNet layer. Thus, a BCL with the 
same number of filters has fewer parameters. The result of 
the BCL convolution is then splatted at all points (xi^yi) 
and passed on to the remaining spatial layers. The conver¬ 
gence behaviour is shown in Fig. 6 and again we find faster 
convergence and also better validation accuracy. The em¬ 
pirical results of this experiment for all tested architectures 
are summarized in Table 4, with BNN variants clearly out¬ 
performing their spatial counterparts. 

The absolute results can be vastly improved by making 
use of virtual examples, e.g., by affine transformations [23]. 
The purpose of these experiments is to compare the net¬ 
works on equal grounds while we believe that additional 
data will be beneficial for both networks. We have no rea¬ 
son to believe that a particular network benefits more. 

7. Conclusion 

We proposed to learn bilateral filters from data. In hind¬ 
sight, it may appear obvious that this leads to performance 
improvements compared to a particular chosen parametric 
form, e.g., the Gaussian. To understand the algorithms that 
facilitate fast approximative computation of Eq. (1) as a pa¬ 
rameterized implementation of a bilateral filter that has free 
parameters is the key insight and enables gradient descent 
based learning. We relaxed the non-separability in the al¬ 
gorithm from [2] to allow for more general filter functions. 
There is a wide range of possible applications for learned 
bilateral filters [41]. In this paper we discussed some ex¬ 
amples that generalize previous work. These include joint 
bilateral upsampling, inference in dense CRFs, and the use 
of bilateral convolutions in neural networks. On several and 
diverse experiments, we observed empirical improvements 























































when learning bilateral filters. 

The bilateral convolutional layer allows for filters whose 
receptive field change given the input image. The feature 
space view provides a canonical way to encode similar¬ 
ity between any kind of objects, not only pixels, but e.g., 
bounding boxes, segmentations, surfaces, etc. The pro¬ 
posed filtering operation is then a natural candidate to de¬ 
fine a filter convolutions on these objects, it takes advantage 
of sparsity and scales to higher dimensions. Therefore, we 
believe that this view will be useful for several problems 
where CNNs can be applied. An open research problem 
is whether the sprase higher dimensional structure also al¬ 
lows for efficient or compact representations for intermedi¬ 
ate layers inside CNN architectures. 
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Appendix 

The appendix contains a more detailed overview of the 
permutohedral lattice convolution in Section A, more exper¬ 
iments in Section B and additional results with experiment 
protocols for the experiments presented before in Section C. 

A. General Permutohedral Convolutions 

A core technical contribution of this work is the gen¬ 
eralization of the Gaussian permutohedral lattice convolu¬ 
tion proposed in [2] to the full non-separable case with the 
ability to perform backpropagation. Although, conceptu¬ 
ally, there are minor difference between non-Gaussian and 
general parameterized filters, there are non-trivial practical 
differences in terms of the algorithmic implementation. The 
Gauss filters belong to the separable class and can thus be 
decomposed into multiple sequential one dimensional con¬ 
volutions. We are interested in the general filter convolu¬ 
tions, which can not be decomposed. Thus, performing a 
general permutohedral convolution at a lattice point requires 
the computation of the inner product with the neighboring 
elements in all the directions in the high-dimensional space. 

Here, we give more details of the implementation dif¬ 
ferences of separable and non-separable filters. In the fol¬ 
lowing we will explain the scalar case first. Recall, that 



Figure 7. Illustration of ID permutohedral lattice construction. 

A 4 X 4 (a:, y) grid lattice is projected onto the plane defined by 
the normal vector (1,1)^. This grid has s + 1 = 4 and d = 2 
(s + = 4^ = 16 elements. In the projection, all points of the 

same color are projected onto the same points in the plane. The 
number of elements of the projected lattice is t = (s + 1)^ — = 

4^ — 3^ = 7, that is the (4 x 4) grid minus the size of lattice that 
is 1 smaller at each size, in this case a (3 x 3) lattice (the upper 
right (3x3) elements). 


the forward pass of general permutohedral convolution in¬ 
volves 3 steps: splatting, convolving and slicing. We follow 
the same splatting and slicing strategies as in [2] since these 
operations do not depend on the filter kernel. The main dif¬ 
ference between our work and the existing implementation 
of [2] is the way that the convolution operation is executed. 
This proceeds by constructing a blur neighbor matrix K 
that stores for every lattice point all values of the lattice 
neighbors that are needed to compute the filter output. 

The blur neighbor matrix is constructed by traversing 
through all the populated lattice points and their neighbor¬ 
ing elements. This is done recursively to share computa¬ 
tions. For any lattice point, the neighbors that are n hops 
away are the direct neighbors of the points that are n — 1 
hops away. The size of a ti dimensional spatial filter with 
width s + 1 is (s + 1)^ (^.g., a 3 X 3 filter, s = 2md = 2 has 
3^ = 9 elements) and this size grows exponentially in the 
number of dimensions d. The permutohedral lattice is con¬ 
structed by projecting a regular grid onto the plane spanned 
by the d dimensional normal vector (1,..., 1)^. See Fig. 7 
for an illustration of ID lattice construction. Many cor¬ 
ners of a grid filter are projected onto the same point, in 
total t = {s elements remain in the permuto¬ 


hedral filter with s neighborhood in d — 1 dimensions. If 
the lattice has m populated elements, the matrix K has size 
t X m. Note that, since the input signal is typically sparse, 
only a few lattice corners are being populated in the slicing 
step. We use a hash-table to keep track of these points and 
traverse only through the populated lattice points for this 
neighborhood matrix construction. 

Once the blur neighbor matrix K is constructed, we can 
perform the convolution by the matrix vector multiplication 

/' = BK, (8) 

where B is the 1 x t filter kernel (whose values we will 
learn) and I' G is the result of the filtering at the 

m lattice points. In practice, we found that the matrix K 
is sometimes too large to fit into GPU memory and we di¬ 
vided the matrix K into smaller pieces to compute Eq. 8 
sequentially. 

In the general multi-dimensional case, the signal / is of c 
dimensions. Then the kernel is of size B x t and K stores 
the c dimensional vectors accordingly. When the input and 
output points are different, we slice only the input points 
and splat only at the output points. 

B. Additional Experiments 

In this section we discuss more use-cases for the learned 
bilateral filters, one use-case of BNNs and two single filter 
applications for image and 3D mesh denoising. 

B.l. Recognition of subsampled MNIST 

One of the strengths of the proposed filter convolution 
is that it does not require the input to lie on a regular grid. 
The only requirement is to define a distance between fea¬ 
tures of the input signal. We highlight this feature with the 
following experiment using the classical MNIST ten class 
classification problem [36]. We sample a sparse set of N 
points {x^y) G [0,1] x [0,1] uniformly at random in the 
input image, use their interpolated values as signal and the 
continuous (x, y) positions as features. This mimics sub¬ 
sampling of a high-dimensional signal. To compare against 
a spatial convolution, we interpolate the sparse set of values 
at the grid positions. 

We take a reference implementation of LeNet [35] that 
is part of the Caffe project [30] and compare it against the 
same architecture but replacing the first convolutional layer 
with a bilateral convolution layer (BCL). The filter size and 
numbers are adjusted to get a comparable number of param¬ 
eters (5x5 for LeNet, 2-neighborhood for BCL). 

The results are shown in Table 5. We see that training 
on the original MNIST data (column Original, LeNet vs. 
BNN) leads to a slight decrease in performance of the BNN 
(99.03%) compared to LeNet (99.19%). The BNN can be 
trained and evaluated on sparse signals, and we resample the 

























Method 

Original 

Test Subsampling 
100% 60% 20% 

LeNet 

BNN 

0.9919 

0.9903 

0.9660 0.9348 0.6434 
0.9844 0.9534 0.5767 

LeNet 100% 
BNN 100% 

0.9856 

0.9900 

0.9809 0.9678 0.7386 
0.9863 0.9699 0.6910 

LeNet 60% 
BNN 60% 

0.9848 

0.9885 

0.9821 0.9740 0.8151 

0.9864 0.9771 0.8214 

LeNet 20% 
BNN 20% 

0.9763 

0.9728 

0.9754 0.9695 0.8928 
0.9735 0.9701 0.9042 


Table 5. Classification accuracy on MNIST. We compare the 
LcNet [35] implementation that is part of Caffe [30] to the net¬ 
work with the first layer replaced by a bilateral convolution layer 
(BCL). Both are trained on the original image resolution (first two 
rows). Three more BNN and CNN models are trained with ran¬ 
domly subsampled images (100%, 60% and 20% of the pixels). 
An additional bilinear interpolation layer samples the input signal 
on a spatial grid for the CNN model. 

image as described above for N = 100%, 60% and 20% of 
the total number of pixels. The methods are also evaluated 
on test images that are subsampled in the same way. Note 
that we can train and test with different subsampling rates. 
We introduce an additional bilinear interpolation layer for 
the LeNet architecture to train on the same data. In essence, 
both models perform a spatial interpolation and thus we ex¬ 
pect them to yield a similar classification accuracy. Once 
the data is of higher dimensions the permutohedral convo¬ 
lution will be faster due to hashing the sparse input points, 
as well as less memory demanding in comparison to naive 
application of a spatial convolution with interpolated val¬ 
ues. 

B.2. Image Denoising 

The main application that inspired the development of 
the bilateral filtering operation is image denoising [5], there 
using a single Gaussian kernel. Our development allows to 
learn this kernel function from data and we explore how to 
improve using a single but more general bilateral filter. 

We use the Berkeley segmentation dataset 
(BSDS500) [4] as a test bed. The color images in the 
dataset are converted to gray-scale, and corrupted with 
Gaussian noise with a standard deviation of 

We compare the performance of four different filter mod¬ 
els on a denoising task. The first baseline model (“Spatial” 
in Table 6, 25 weights) uses a single spatial filter with a 
kernel size of 5 and predicts the scalar gray-scale value at 
the center pixel. The next model (“Gauss Bilateral”) ap¬ 
plies a bilateral Gaussian filter to the noisy input, using po¬ 
sition and intensity features f = {x^y^v)^. The third setup 
(“Learned Bilateral”, 65 weights) takes a Gauss kernel as 
initialization and fits all filter weights on the “train” image 
set to minimize the mean squared error with respect to the 
clean images. We run a combination of spatial and permuto¬ 
hedral convolutions on spatial and bilateral features (“Spa- 



Figure 8. Sample data for 3D mesh denoising. (top) Some 3D 
body meshes sampled from [38] and (bottom) the corresponding 
noisy meshes used in denoising experiments. 


tial + Bilateral (Learned)”) to check for a complementary 
performance of the two convolutions. 


Method 

PSNR 

Noisy Input 

20.17 

Spatial 

26.27 

Gauss Bilateral 

26.51 

Learned Bilateral 

26.58 

Spatial + Bilateral (Learned) 

26.65 


Table 6. PSNR results of a denoising task using the BSDS500 
dataset [4] 

The PSNR scores evaluated on full images of the “test” 
image set are shown in Table 6. We find that an untrained 
bilateral filter already performs better than a trained spatial 
convolution (26.27 to 26.51). A learned convolution fur¬ 
ther improve the performance slightly. We chose this sim¬ 
ple one-kernel setup to validate an advantage of the general¬ 
ized bilateral filter. A competitive denoising system would 
employ RGB color information and also needs to be prop¬ 
erly adjusted in network size. Multi-layer perceptrons have 
obtained state-of-the-art denoising results [14] and the per¬ 
mutohedral lattice layer can readily be used in such an ar¬ 
chitecture, which is intended future work. 

B.3. 3D Mesh Denoising 

Permutohedral convolutions can naturally be extended to 
higher (> 2) dimensional data. To highlight this, we use the 
proposed convolution for the task of denoising 3D meshes. 












Figure 9. 4D isomap features for 3D human bodies. Visualiza¬ 
tion of 4D isomap features for a sample 3D mesh. Isomap feature 
values are overlaid onto mesh vertices. 


We sample 3D human body meshes using a generative 
3D body model from [38]. To the clean meshes, we add 
Gaussian random noise displacements along the surface 
normal at each vertex location. Figure 8 shows some sam¬ 
ple 3D meshes sampled from [38] and corresponding noisy 
meshes. The task is to take the noisy meshes as inputs and 
recover the original 3D body meshes. We create 1000 train¬ 
ing, 200 validation and another 500 testing examples for the 
experiments. 

Mesh Representation: The 3D human body meshes 
from [38] are represented with 3D vertex locations and 
the edge connections between the vertices. We found that 
this signal representation using global 3D coordinates is 
not suitable for denoising with bilateral filtering. There¬ 
fore, we first smooth the noisy mesh using mean smooth¬ 
ing applied to the face normals [53] and represent the noisy 
mesh vertices as 3D vector displacements with respect to 
the corresponding smoothed mesh. Thus, the task becomes 
denoising the 3D vector displacements with respect to the 
smoothed mesh. 



Noisy Mesh 

Normal 

Smoothing 

Gauss 

Bilateral 

Learned 

Bilateral 

Vertex Distance 
(RMSE) 

5.774 

3.183 

2.872 

2.825 

Normal Angle 
Error 

19.680 

19.707 

19.357 

19.207 


Table 7. Body Denoising. Vertex distance RMSE values and nor¬ 
mal angle error (in degrees) corresponding to different denoising 
strategies averaged over 500 test meshes. 



Ground Truth Given Noisy Mesh Denoised Mesh 

Figure 10. Sample Denoising Result. Ground truth mesh (left), 
corresponding given noisy mesh (middle) and the denoised result 
(right) using learned bilateral filter. 


and then slicing back into original 3D input space. The Ta¬ 
ble 7 shows quantitative results as RMSE for different de¬ 
noising strategies. The normal smoothing [53] already re¬ 
duces the RMSE. The Gauss bilateral filter results in signifi¬ 
cant improvement over normal smoothing with and learning 
the filter weights again improves the result. A visual result 
is shown in Eigure 10. 


Isomap Features: To apply permutohedral convolution, 
we need to define features at each input vertex point. We 
use 4 dimensional isomap embedding [49] of the given 
3D mesh graph as features. The given 3D mesh is con¬ 
verted into weighted edge graph with edge weights set to 
Euclidean distance between the connected vertices and to 
infinity between the non-connected vertices. Then 4 di¬ 
mensional isomap embedding is computed for this weighted 
edge graph using a publicly available implementation [48]. 
Eig. 9 shows the visualization of isomap features on a sam¬ 
ple 3D mesh. 

Experimental Results: Mesh denoising with a bilateral 
filter proceeds by splatting the input 3D mesh vectors (dis¬ 
placements with respect to smoothed mesh) into the 4D 
isomap feature space, filtering the signal in this 4D space 


C. Additional results 

This section contains more qualitative results for the ex¬ 
periments of the main paper. 

C.l. Lattice Visualization 

Eigure 11 shows sample lattice visualizations for differ¬ 
ent feature spaces. 

C.2. Color Upsampling 

In addition to the experiments discussed in the main pa¬ 
per, we performed the cross-factor analysis of training and 
testing at different upsampling factors. Table 8 shows the 
PSNR results for this analysis. Although, in terms of PSNR, 
it is optimal to train and test at the same upsampling factor, 
the differences are small when training and testing upsam- 











Given Image 0.05.(x,y) 0.01.(x, 2 /) 0.05.(r,^,5) 0.01.(r,^,5) 0.05.(x,:y,r,5f,5) 0.01.(x,:y,r,5f,5) 


Figure 11. Visualization of the Permutohedral Lattice. Sample lattice visualizations for different feature spaces. All pixels falling in the 
same simplex cell are shown with the same color, {x, y) features correspond to image pixel positions, and (r, g, b) G [0, 255] correspond 
to the red, green and blue color values. 


Test Factor 

2x 4x 8x 16x 
S 2x 38.45 36.12 34.06 32.43 
I 4x 38.40 36.16 34.08 32.47 
I 8x 38.40 36.15 34.08 32.47 
^ 16x 38.26 36.13 34.06 32.49 

Table 8. Color Upsampling with different train and test up- 
sampling factors. PSNR values corresponding to different up- 
sampling factors used at train and test times on the 2 megapixel 
image dataset, using our learned bilateral filters. 

pling factors are different. Some images of the upsampling 
for the Pascal VOC12 dataset are shown in Fig. 12. It is es¬ 
pecially the low level image details that are better preserved 
with a learned bilateral filter compared to the Gaussian case. 

C.3. Depth Upsampling 

Figure 13 presents some more qualitative results com¬ 
paring bicubic interpolation, Gauss bilateral and learned bi¬ 
lateral upsampling on NYU depth dataset image [43]. 


C.4. Character Recognition 

Figure 14 shows the schematic of different layers of 
the network architecture for LeNet-7 [36] and DeepCNet(5, 
50) [17, 23]. For the BNN variants, the first layer filters are 
replaced with learned bilateral filters and are learned end- 
to-end. 

C.5. Semantic Segmentation 

Some more visual results for semantic segmentation are 
shown in Figure 15. These include the underlying DeepLab 
CNN[16] result (DeepLab), the 2 step mean-field result 
with Gaussian edge potentials (+2stepMF-GaussCRF) and 
also corresponding results with learned edge potentials 
(+2stepMF-LearndCRF). In general, we observe that mean- 
field in learned CRF leads to slightly dilated classifica¬ 
tion regions in comparison to using Gaussian CRF thereby 
filling-in the false negative pixels and also correcting some 
mis-classified regions. 

























(a) Input (b) Gray Guidance (c) Ground Truth (d) Bicubic Interpolation (e) Gauss Bilateral (f) Learned Bilateral 

Figure 12. Color Upsampling. Color 8x upsampling results using different methods (best viewed on screen). 



(a) Input (b) Guidance (c) Ground Truth (d) Bicubic Interpolation (e) Gauss Bilateral (f) Learned Bilateral 

Figure 13. Depth Upsampling. Depth 8x upsampling results using different upsampling strategies. 


C.6. Material Segmentation 

In Fig. 16, we present visual results comparing 2 step 
mean-field inference with Gaussian and learned pairwise 
CRF potentials. In general, we observe that the pixels be¬ 
longing to dominant classes in the training data are being 
more accurately classified with learned CRF. This leads to 
a significant improvements in overall pixel accuracy. This 
also results in a slight decrease of the accuracy from less 
frequent class pixels thereby slightly reducing the average 
class accuracy with learning. We attribute this to the type of 


annotation that is available for this dataset, which is not for 
the entire image but for some segments in the image. We 
have very few images of the infrequent classes to combat 
this behaviour during training. 

C.7. Experiment Protocols 

Table 9 shows experiment protocols of different experi¬ 
ments. 












































(a) LeNet-7 



(b) DeepCNet 

Figure 14. CNNs for Character Recognition. Schematic of (top) LeNet-7 [36] and (bottom) DeepCNet(5,50) [17, 23] architectures used 
in Assamese character recognition experiments. 


[ IP+TanH+Softmax ] 



































































I Background | Aeroplane | Bicycle | Bird | Boat | Bottle | Bus | Car 
I Cat I Chair | Cow | Dining Table | Dog | Horse | Motorbike | Person 
I Potted Plant | Sheep | Sofa | Train | TV monitor 



(a) Input (b) Ground Truth (c) DeepLab (d) +2stepMF-GaussCRF (e) +2stepMF-LearnedCRF 

Figure 15. Semantic Segmentation. Example results of semantic segmentation, (c) depicts the unary results before application of ME, (d) 
after two steps of ME with Gaussian edge CRE potentials, (e) after two steps of ME with learned edge CRE potentials. 




























I Brick I Carpet Ceramic | Fabric | Foliage | Food | Glass | Hair 
I Leather | Metal | Mirror | Other | Painted | Paper | Plastic | Polished Stone 
I Skin I Sky | Stone | Tile | Wallpaper | Water | Wood 



(a) Input 


(b) Ground Truth 


(c) DeepLab 


(d) +2stepMF-GaussCRF (e) +2stepMF-LearnedCRF 


Figure 16. Material Segmentation. Example results of material segmentation, (c) depicts the unary results before application of MF, (d) 
after two steps of MF with Gaussian edge CRF potentials, (e) after two steps of MF with learned edge CRF potentials. 


























Experiment 

Feature Types 

Feature Scales 

Filter 

Size 

Filter 

Nbr. 

Data Statistics 

Train Val. Test 

Training Protocol 

Loss Type LR Batch 

Epochs 

Single Bilateral Filter Applications 

2x Color Upsampling Positioni, Intensity (3D) 

0.13,0.17 

65 

2 

10581 

1449 

1456 

MSB 

le-06 

200 

94.5 

4 X Color Upsampling 

Positioni, Intensity (3D) 

0.06, 0.17 

65 

2 

10581 

1449 

1456 

MSB 

le-06 

200 

94.5 

8 X Color Upsampling 

Positioni, Intensity (3D) 

0.03,0.17 

65 

2 

10581 

1449 

1456 

MSB 

le-06 

200 

94.5 

16 X Color Upsampling 

Positioni, Intensity (3D) 

0.02, 0.17 

65 

2 

10581 

1449 

1456 

MSB 

le-06 

200 

94.5 

Depth Upsampling 

Positioni, Color (5D) 

0.05, 0.02 

665 

2 

795 

100 

654 

MSB 

le-07 

50 

251.6 

Mesh Denoising 

Isomap (4D) 

46.00 

63 

2 

1000 

200 

500 

MSB 

100 

10 

100.0 

DenseCRF Applications 

Semantic Segmentation 

Positioni, Color (5D); 

- Istep ME Positioni (2D) 

0.01, 0.34; 0.34 

665;19 

2; 2 

10581 

1449 

1456 

Logistic 

0.1 

5 

1.4 

- 2step ME 

Positioni, Color (5D); 
Positioni (2D) 

0.01, 0.34; 0.34 

665;19 

2; 2 

10581 

1449 

1456 

Logistic 

0.1 

5 

1.4 

- loose 2step ME 

Positioni, Color (5D); 
Positioni (2D) 

0.01, 0.34; 0.34 

665;19 

2; 2 

10581 

1449 

1456 

Logistic 

0.1 

5 

+1.9 

Material Segmentation 

- Istep ME 

Position 2 , Lab-Color (5D) 

5.00, 0.05, 0.30 

665 

2 

928 

150 

1798 

Weighted 

Logistic 

le-04 

24 

2.6 

- 2step ME 

Position 2 , Lab-Color (5D) 

5.00, 0.05, 0.30 

665 

2 

928 

150 

1798 

Weighted 

Logistic 

le-04 

12 

+0.7 

- loose 2step ME 

Position 2 , Lab-Color (5D) 

5.00, 0.05, 0.30 

665 

2 

928 

150 

1798 

Weighted 

Logistic 

le-04 

12 

+0.2 

Neural Network Applications 

Tiles: CNN-9 x 9 


81 

4 

10000 

1000 

1000 

Logistic 

0.01 

100 

500.0 

Tiles: CNN-13 x 13 

- 

- 

169 

6 

10000 

1000 

1000 

Logistic 

0.01 

100 

500.0 

Tiles: CNN-17 x 17 

- 

- 

289 

8 

10000 

1000 

1000 

Logistic 

0.01 

100 

500.0 

Tiles: CNN-21 x 21 

- 

- 

441 

10 

10000 

1000 

1000 

Logistic 

0.01 

100 

500.0 

Tiles: BNN 

Positioni, Color (5D) 

0.05, 0.04 

63 

1 

10000 

1000 

1000 

Logistic 

0.01 

100 

30.0 

LeNet 

- 

- 

25 

2 

5490 

1098 

1647 

Logistic 

0.1 

100 

182.2 

Crop-LeNet 

- 

- 

25 

2 

5490 

1098 

1647 

Logistic 

0.1 

100 

182.2 

BNN-LeNet 

Position 2 (2D) 

20.00 

7 

1 

5490 

1098 

1647 

Logistic 

0.1 

100 

182.2 

DeepCNet 

- 

- 

9 

1 

5490 

1098 

1647 

Logistic 

0.1 

100 

182.2 

Crop-DeepCNet 

- 

- 

9 

1 

5490 

1098 

1647 

Logistic 

0.1 

100 

182.2 

BNN-DeepCNet 

Position 2 (2D) 

40.00 

7 

1 

5490 

1098 

1647 

Logistic 

0.1 

100 

182.2 


Table 9. Experiment Protocols. Experiment protocols for the different experiments presented in this work. Feature Types: Feature 
spaces used for the bilateral convolutions. Positioni corresponds to un-normalized pixel positions whereas Position 2 corresponds to pixel 
positions normalized to [0,1] with respect to the given image. Feature Scales: Cross-validated scales for the features used. Filter Size: 
Number of elements in the filter that is being learned. Filter Nbr.: Half-width of the filter. Train, Val. and Test corresponds to the number 
of train, validation and test images used in the experiment. Loss Type: Type of loss used for back-propagation. “MSB” corresponds to 
Euclidean mean squared error loss and “Logistic” corresponds to multinomial logistic loss. “Weighted Logistic” is the class-weighted 
multinomial logistic loss. We weighted the loss with inverse class probability for material segmentation task due to the small availability 
of training data with class imbalance. LR: Fixed learning rate used in stochastic gradient descent. Batch: Number of images used in one 
parameter update step. Epochs: Number of training epochs. In all the experiments, we used fixed momentum of 0.9 and weight decay of 
0.0005 for stochastic gradient descent. “‘Color Upsampling” experiments in this Table corresponds to those performed on Pascal VOC12 
dataset images. For all experiments using Pascal VOC12 images, we use extended training segmentation dataset available from [25], and 
used standard validation and test splits from the main dataset [20]. 







